Introduction

This file produces a set of basic quality checks to highlight potential issues in data collection. The survey used in this report is the 2018 Mozambique SDI survey.

Missing Values

To start, the following figures and tables will highlight missing values for a few of our key indicators. Ignore the fact that school knowledge, operational management, management skills, instructional leadership, and ECD scores are missing for now, as this information was not available fully in SDI.

Teacher absence is the most problematic, followed by infrastructure, inputs, and content knowledge.

Below the missings plot is a table of summary statistics for a few key indicators. This shows the min, 25th percentile, median, 75th percentile, max, mean, standard deviation, total number of schools, and number of schools with missing information for each variable. The underlying data is aggregated to the school, and the means reported are raw means, not weighted means, which will be produced in the report. These are meant to give a basic idea of the data.

Summary Statistics and Counts of Missing Values for Key Dashboard Indicators
var min q25 median q75 max mean sd n number_missing
4th Grade Student Assessment 1 13.000 25 47.5 87.0 32.043011 22.5968327 338 59
Teacher Absence 0 0.000 22 50.0 100.0 32.768340 35.2619269 338 79
Teacher Assessment 10 34.000 41 49.0 71.0 41.479554 11.4421220 338 69
TEACH Pedagogy Score 13 25.125 29 33.5 47.5 29.553309 6.3067589 338 63
Basic Inputs 0 1.000 2 2.0 3.0 1.793680 0.8029778 338 69
Basic Infrastructure 0 1.000 1 2.0 4.0 1.553232 0.8190358 338 75

Interactive Map

In the following map below, users may click on specific provinces or regions to examine missing indicators. The slider controls the schools that appear based on the number of missing indicators. For instance, sliding the slider to 4 will keep only schools that are missing four or more indicators, indicating a relatively severe missing data problem. In the future, I may also include checkboxes for specific survey supervisors, to examine if any particular supervisors have worse performance than others. I could also add filters by day the survey took place.

The map is color coded. Green indicators, for instance, have no missing information on our key indicators: 4th grade student achievement, teacher absence, teacher content knowledge, teacher pedagogy (TEACH), basic inputs, and basic infrastructure ). Black indicators are missing all six indicators. More indicators can be added to this list, but for now in the SDI data this is what we could produce before our data collection.

Outlier Plots

In the following, we highlight schools that have outliers in terms terms of their practice indicators compared to their 4th grade learning outcomes. A simple model is estimated relating 4th grade student learning to our practice indicators: teacher absence, teacher content knowledge, teacher pedagogy (TEACH), basic inputs, and basic infrastructure. The learning outcomes are compared to predicted values from this model. For instance, if the school scores poorly on teacher absence, content knowledge, pedagogy, inputs and infrastructure and thus has a low predicted value for student achievement, but in fact has very high student achievement, it may signal a problem with the quality of the data for that school. This is meant to be merely a first check of the data, and does not necessarily indicate a problem.

We model the fraction correct on the student achievement exam using the logistic functional form:

\[E(A_i|X_i)=\frac{e^{(\beta_0 + \beta_1X_i)}}{1+e^{(\beta_0 + \beta_1 X_i)}}\]

Where \(A_i\) is student achievement in fourth grade for school i. \(X_i\) is a vector of our practice indicators: teacher absence, teacher content knowledge, teacher pedagogy (TEACH), basic inputs, and basic infrastructure. The logistic functional form for the fraction correct was chosen, because it can be justified using a very simple Rasch IRT model, in which the probability of answering correctly to each item also follows the logistic functional form.

The map below is color coded to show where the gap between the actual student achievement and the predicted student achievement are largest. The slider allows users to filter based on the squared error from the model, the squared difference between actual achievement and predicted achievement, to look for school locations that may be concerning. Again, this does not necessarily mean there is a problem, but should trigger more investigation.

Summary Statistics from Outlier Model

stargazer(logitmod, title="Logit Model of Outliers - Coefficients", type="html")
Logit Model of Outliers - Coefficients
Dependent variable:
student_knowledge_01
absence_rate 0.001
(0.005)
content_knowledge 0.004
(0.014)
pedagogical_knowledge 0.027
(0.026)
inputs 0.177
(0.227)
infrastructure 0.242
(0.206)
Constant -2.456**
(0.961)
Observations 176
Log Likelihood -98.226
Akaike Inf. Crit. 208.451
Note: p<0.1; p<0.05; p<0.01
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode